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Abstract 


The priority queue is a fundamental data structure that is used in a large variety of 
parallel algorithms, such as multiprocessor scheduling and parallel best-first search of 
state-space graphs. This thesis addresses the design and experimental evaluation of two 
novel concurrent priority queues: a parallel Fibonacci heap and a concurrent priority 
pool, and compares them with the concurrent binary heap. The parallel Fibonacci heap 
is based on the sequential Fibonacci heap, which is theoretically the most efficient data 
structure for sequential priority queues. This scheme not only preserves the efficient 
operation time bounds of its sequential counterpart, but also has very low contention 
by distributing locks over the entire data structure. The experimental results show its 
linearly scalable throughput and speedup up to as many processors as tested (currently 
18). A concurrent access scheme for a doubly linked list is described as part of the 
implementation of the parallel Fibonacci heap. The concurrent priority pool is based 
on the concurrent B-tree and the concurrent pool. The concurrent priority pool has the 
highest throughput among the priority queues studied. Like the parallel Fibonacci heap, 
the concurrent priority pool scales linearly up to as many processors as tested. The 
priority queues are evaluated in terms of throughput and speedup. Some applications of 
concurrent priority queues such as the vertex cover problem and the single source shortest 
path problem are tested. 
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Chapter 1 
Introduction 


The priority queue is a fundamental data structure that is used in a large variety of 
parallel algorithms, such as multiprocessor scheduling and parallel best-first search of 
state-space graphs[Win84, Nil80, Pea84, KRR88]. In these algorithms, each process 
performs an access-think cycle. Every process works on its current node (thinking), 
then accesses the shared priority queue to insert nodes if it generated any, eztract a 
high priority node to work on next, increase the priorities of some nodes by decreasing 
the keys!, and delete some nodes from the priority queue if they no longer need to be 
worked on. Sequential priority queues are usually represented as binary heaps, Fibonacci 
heaps, or B-trees (see Chapter 2). Concurrent priority queues are used in a large number 
of parallel algorithms. An example is Seneff’s speech recognition parser[Sen89], which 
maintains a priority queue of unparsed grammar nodes with associated priorities, and 
parses grammar nodes with higher priorities first. 

We call the extract operation of a concurrent priority queue strict if it extracts the 
element with the highest priority in the queue. Strict extract operations require some kind 
of serialization of operations performed on a queue, which increases the contention on 
the queue. As discussed in section 3.1, most applications only need to extract promising 
elements that have high priority instead of the highest priority; this fact can be used to 
decrease contention on the priority queue. However, the promising quality of extracted 
nodes should be controlled to satisfy the requirements of different applications. 

Biswas and Browne [BB87] present a scheme that allows parallel insertions and ex- 
tractions in strict concurrent binary heaps, but it does not perform better than the serial 
access scheme even for heaps with 1,000 nodes. In the serial access scheme, each operation 


Tn this thesis, we use small keys to denote high priority. 


locks the binary heap exclusively during the whole period of the operation. Rao and Ku- 
mar [RK88b] describe a concurrent binary heap algorithm for concurrent priority queues 
that has less overhead and provides strict extract operations. However, their scheme sat- 
urates when the number of processes accessing the priority queue is greater than about 
ten?. More recently, Kumar et al[KRR88] present several “distributed” formulations of 
priority queues based on binary heaps with relaxed strictness of priority. 

This thesis presents the design and experimental evaluation of different implemen- 
tations of concurrent priority queues. We present a novel concurrent priority queue 
mechanism based on the Fibonacci heap, which is theoretically the most efficient data 
structure for the sequential priority queue. This parallel Fibonacci heap provides oper- 
ations that are theoretically and practically more efficient than the concurrent binary 
heap. A concurrent access scheme for a doubly linked list is described as part of the 
Fibonacci heap implementation. We also describe a new concurrent priority queue, the 
concurrent priority pool, that is based on concurrent B-trees [WW90][LY81][LSS87] and 
concurrent pools [KE89][Man86]. As shown in Chapter 5, this scheme has the highest 
throughput among all concurrent priority queues studied here. The performance of dif- - 
ferent concurrent priority queues is analyzed using the language Mul-T[KHM89] on an 
Encore Multimaz shared memory multiprocessor. The performance indicates that both 
the parallel Fibonacci heap and the concurrent priority pool are linearly scalable and 
have larger throughput than the concurrent binary heap. The single source shortest path 
problem and the vertex cover problem are tested as applications of concurrent priority 


queues. 


1.1 Parallel Fibonacci Heap 


The parallel Fibonacci heap is based on the sequential Fibonacci heap, which is theoreti- 
cally the most efficient data structure for sequential priority queues. The critical sections 
acquired by the operations on the parallel Fibonacci heap are small and distributed over 
the entire data structure. Therefore, the parallel Fibonacci heap has low contention. The 
insert operation takes constant time, the decrease key operation takes constant amor- 
tized time, and the extract and delete operations take logarithmic time. This scheme 
provides more scalable operations and higher throughput than current schemes such as 
the concurrent binary heap. An algorithm for concurrent access to doubly linked lists is 


2This value depends on the length of the think time. Experimental results are shown in Chapter 5. 
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described as part of the implementation of parallel Fibonacci heaps. 


1.2 Concurrent Priority Pool 


Concurrent priority pools are based on concurrent B-trees and concurrent pools. Since 
the concurrent priority queue employs a distributed data structure(the pool), the insert 
and extract operations do not share critical resources in most cases. As shown in Chapter 
5, concurrent priority pools have the highest throughput among all concurrent priority 
queues investigated. Concurrent priority pools also allow tight control over the quality of 
extracted nodes. Insert operations run in logarithmic time, and extract operations take 


logarithmic time in the worst case. 


1.3 Experimental Environment 


I performed most of the experiments on two Encore shared memory multiprocessors. 
One of the Encore machines has 20 processors of which 18 processors can be used for 
running Mul-T. The concurrent priority queues were implemented in Mul-T, a Lisp-like 
programming language with futures and locking mechanisms. 


1.4 Overview 


Chapter 2 describes various implementations of sequential priority queues, such as binary 
heaps, binomial heaps, Fibonacci heaps, and B-trees. 

Chapter 3 presents the data structure and concurrent access algorithms for the parallel 
Fibonacci heap. The concurrent operations on a doubly linked list are described as part 
of the implementation. 

Chapter 4 presents the data structure of concurrent priority pools and concurrent 
operations on it. 

Chapter 5 gives an experimental analysis of different implementations of concurrent 
priority queues. 

Chapter 6 presents a summary of what has been accomplished and discusses some 
related research and directions for future research. 
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Chapter 2 


Preliminaries of Sequential Priority 


Queues 


A sequential priority queue is a data structure for maintaining a set of nodes, each with 
an associated key value, and other satellite data. In this thesis, a small key value has 
higher priority than a larger key value. A priority queue usually supports the following 


operations: 


1. Insert: insert a new node into the queue, 


2. Extract: delete the node from the queue whose key is minimum, returning a pointer 


to the node, 


3. Decrease key: make a node in the queue have a new key value which is assumed to 


be less than its current key value, 
4. Delete: delete a node from the queue, 


5. Union: create and return a new queue that contains all the nodes of the two input 


queues. 


There are many applications of priority queues. An example is scheduling processes on 
a shared computer, especially on large multiprocessor systems. The priority queue keeps 
track of the processes to be performed and their relative priorities. When a processor 
is idle, it selects a high priority process from the priority queue to work on. Another 
application is the state space search in many graph algorithms, such as Dijkstra’s single 
source shortest path algorithm(SSSP) and the vertex cover problem(VCP)[CLR90]. In 
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nae Binary es Binomial 3 F eee ou B-tree 
Operation = oe “oe case) 

INSERT O(lg n) 

EXTRACT O(lg n) 


DECREASE O(lg n) 
DELETE O(lg n) 
UNION Not well supported 


Table 2.1: Time bounds of operations on different sequential priority queue implementa- 
tions 


the SSSP algorithm, a priority queue is used to monitor the distance of each vertex from 
the source, and the algorithm always explores the “closest” vertex first. In the VCP, we 
use a priority queue to keep track of the state-space search graph. 

This chapter discusses different implementations of sequential priority queues, binary 
heaps, binomial heaps, Fibonacci heaps, and B-trees. We adopt the notation from the 
book Introduction to Algorithms{CLR90]. Table 2.1 shows the running times for opera- 
tions on these four implementations of priority queues. The number of nodes in the heap 


at the time of an operation is denoted by n. 


2.1 Binary Heap 


2.1.1 Data Structure 


The binary heap can be viewed as a complete binary tree, as shown in Figure 2.1(a), each 
node of which has a key. The heap satisfies the heap property: the value of a node is 
at least as big as the value of its parent. Thus, the node with the smallest key in a heap 
is stored at the root, and the subtrees rooted at a node contain larger values than the 
node. The tree is completely filled on all levels except possibly the bottom level, which 
is completely filled from the left up to a point. 

Before presenting the access schemes for a binary heap, we first briefly describe an 
efficient representation of a binary heap using an array, as shown in Figure 2.1(b). Each 


node of the tree corresponds to an element of the array. The root occupies location 1 
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Insertion path @ 


(a) 


Se AFIEARAIe 9 10 11 


(b) 


Figure 2.1: A binary heap (a) viewed as a binary tree (b) represented as an array. The 
number within the circle representing a node in the tree is the value stored at that node. 
The number next to a node is the corresponding index in the array. 


and node z occupies location 7. The left child of node 1, LCHILD(t), occupies location 
22 and its right child, RCHILD(t), occupies location 2: + 1. The parent of node : is at 
[4]. Associated with the heap are data fields lastelem and fulllevel, in which lastelem 
is the index of the last non-empty node of the heap and fulllevel is the index of the 
first node at the bottom level of the heap that contains at least one non-empty node. 
For an empty heap, lastelem = fulllevel = 0. An empty node has a special key called 
MAXINT whose value is oo. Figure 2.1 shows a heap with 11 keys, and the values of 
lastelem and fulllevel. 


2.1.2 Operations on a Binary Heap 


The operations usually performed on a binary heap are insertion and extraction. Here 
we show the algorithms [RK88a] for doing insertions and deletions; both proceed from 
the root to the bottom of a binary tree. 
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The insert operation adds a node into the binary heap. Let target be the first empty 
node in the heap; this will be the last non-empty node after the insertion. The insertion 
path is the path between the root and target. Figure 2.1(a) shows a ten node heap, to 
which the eleventh node is being added. The insertion path can be traversed starting 
from the root as follows. Let I be the displacement of target at the bottom level(i.e., 
I = lastelem — fulllevel) and P be the length of the insertion path. If we view I as a P 
bit binary number, the bits of the binary representation of J (from the most significant 
to the least significant) tell us whether to go right (if 1) or left (if 0) when we go from the 
root downward. In the example in Figure 2.1(a), fulllevel = 8, target = 11, so ] =3 = 
(011) in binary representation. This means that we can go from the root to the target 
by following left, right, and right branches at successive levels. The algorithm is given in 
Figure 2.2. 

Figure 2.3 shows the pseudocode for the delete operation. It removes the root of the 
heap and places the key of the last non-empty node of the heap at the root. The heap 
property may now be violated at the root of the heap. Reheapification is performed by 
repeatedly pushing this key downward until the heap property is satisfied. 

Since a heap of n nodes is based on a complete binary tree, its height is O(Ig n). The 
insert and extract operations run in time at most proportional to the height of the tree; 


thus, these operations take O(lg n) time. 


2.2 Fibonacci Heap 


Fibonacci heaps were introduced by Fredman and Tarjan{[FT87]. The Fibonacci heap 
has the best amortized time bound for all operations among the implementations listed 
in Table 2.1. From a theoretical point of view, Fibonacci heaps are especially desirable 
when the number of extract-min and delete operations is small relative to the number of 
other operations performed. This situation arises in many applications, such as comput- 
ing minimum spanning trees[(CLR90] and Dijkstra’s algorithm for finding single source 
shortest paths[CLR90]. From a practical standpoint, the Fibonacci heap is generally 
regarded as being only of theoretical interest because of its code complexity and con- 
stant overhead. However, for parallel applications, the time spent on acquiring critical 
resources, like locking and waiting, can be dominant over the constant overhead. In 
fact, the experimental results in chapter 5 show that the parallel Fibonacci heap is more 
scalable and efficient than the concurrent binary heap whose code is much shorter. We 
first examine a simpler data structure, the binomial heap, which is the basis for the Fi- 
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proc insert(heap, nkey) 
% insert a new nkey into heap 


lastelem := lastelem + 1 

target := lastelem 

if (lastelem > fulllevel*2) then 
fulllevel := lastelem 


end 

i := target — fulllevel % i is the displacement of target 
j = fulllevel/2 % 7 = Qlength of insertion path -— 1 

p:= 1 % p is the current position in the insertion path 


% Reheapification loop 
while (j 4 0) 
if (key[p] > nkey) then 
Exchange(nkey, key[p]) 


end 
if (i > j) then 
p := rchild(p) 
i:=i-j 
else 
p := Ichild(p) 
end . 
j= j/2 
end 
key[p] := nkey 


22 end _ insert 


bonacci heap. We then present an analysis of the data structure and the operations on 


Figure 2.2: Insert operation on binary heap 


the Fibonacci heap. 


2.2.1 Binomial Heap 


A binomial heap is a collection of binomial trees. The binomial tree B, is defined 
recursively. The binomial tree Bo consists of a single node. The binomial tree B, consists 
of two binomial trees B,_; that are linked together: the root of one tree is the leftmost 
child of the root of the other. The binomial tree B, has the following properties, 


1. There are 2* nodes, 


2. The height of the tree is k, 
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proc delete(heap) 


1 if (lastelem = 0) then 

2 return nil 

3 end 

4 least := key[1] % root resides at the location 1 of the array 
5 i:= 1 

6 j := lastelem 

7 lastelem := lastelem — 1 

8 if (lastelem < fulllevel) then 
9 fulllevel := fulllevel/2 
10 end 

ll if G = 1) 

12 key(1] := MAXINT 

13 return least 

14 end 


15 key[i] := key{j] 
16 key{j] = MAXINT 


% Reheapification loop 
% let min—son(i) be the index of the son of i which has smaller key 
17. while (key[i] > key[min—son(i)]) do 


18 Exchange(key[i], key[min—son(i)]) 
19 i := min-son(i) 
20 end 


21 + return least 
22 end delete 


Figure 2.3: Delete operation on binary heap 


3. The root has degree k, which is greater than that of any other node; if the children 
of the root are numbered from left to right by k —1,k —2,...,0, child z is the root 


of a subtree B;. 


A binomial heap & is a set of binomial trees that satisfies the following binomial- 


heap properties. 


1. Each binomial tree in A is heap-ordered: the key of a node is greater than or 


equal to the key of its parent. 
2. There is at most one binomial tree in h whose root has a given degree. 


The first property tells us that the root of a heap-ordered tree contains the smallest key 
in the tree. The second property implies that an n-node binomial heap h consists of at 


most {lg n| + 1 binomial trees. 
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The insert operation on binomial heaps creates a new tree on its own of degree 0. 
This may now violate the bionomial heap property 2 above, since there may be another 
tree of degree 0. If there is another tree of degree 0, the two degree 0 trees are merged 
into a single tree of degree 1 by making one tree a child of the other according to the 
heap-order rule (i.e., the root of the tree with the larger key is made a child of the root 
of the tree with the smaller key). This may again violate the bionomial heap property; if 
so, we continue merging in recursive fashion. Thus, the insertion operation runs in time 
at most proportional to the number of binomial trees, which is O(/g n). 

The extract operation is very similar to the insert operation, and also takes time 
O(lg n). The worst-case time bounds for the binomial heap are shown in Table 2.1. In 
particular, the Union operation takes only O(lg n) time to merge two binomial heaps 
with a total of n elements, which is better than the O(n) time for the binary heap. 


2.2.2 Structure of Fibonacci Heap 


Like a binomial heap, a Fibonacci heap is a collection of trees. However, a Fibonacci 
heap is a more “relaxed” data structure than a binomial heap: the trees in a Fibonacci 
heap are not constrained to be as those in a binomial heap, in that there may be many 
trees of a given degree as opposed to only one for a given degree in a binomial heap. 
Furthermore, an interior node of a tree may lose at most one child after it becomes an 
interior node and a root node may lose multiple children. This more relaxed structure 
allows for improved operation time bounds by delaying work that maintains the structure 
until it is convenient to perform. 

As Figure 2.4 shows, a Fibonacci heap is a collection of trees whose roots are linked 
in a circular, doubly linked list called the root list; the heap is accessed through a min 
pointer to the root of the tree containing a minimum key. An empty heap has a nil min 
pointer. Each node z in a tree contains a pointer p(z] to its parent and a pointer child[z] 
to any one of its children. The children of x are linked together in a circular, doubly 
linked list called the child list of x. Each child y in a child list has pointers le ft[y] and 
rightly] that point to y’s left and right siblings, respectively. The number of children in 
the child list of node z is stored in degree[z]. The boolean-valued field mark[z] indicates 
whether node z has lost a child since the last time z was made the child of another node. 
The mark field is used only in decrease and delete operations. 

Circular, doubly linked lists(DLL) have two advantages for use in Fibonacci heaps. 
First, we can remove a node from a circular, doubly linked list in O(1) time. Second, 
given two such lists, we can concatenate them into one circular, doubly linked list in O(1) 
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Figure 2.4: An example of Fibonacci heap 


time. I have designed a parallel access scheme for DLL, described in section 3.2, that 
preserves the above two advantages. 


2.2.3. Insert Operation 


To insert a node into a Fibonacci heap, we only need to insert the node into the root list 
of the heap and return a pointer to it. If the heap was empty, or the newly inserted node 
has a smaller key than that of the minimum node, min is changed to point to the new 
node. The insertion only takes constant time compared to O(lg n) in the binary heap 
and the binomial heap. Figure 2.5 shows the pseudo code for the insert operation. 


2.2.4 Extract Operation 


The process of extracting the minimum node consists of two steps. The first step, finding 
the minimum node and removing it from the heap, is not hard, since we have the min 
pointer to the minimum node. The pseudo code for extracting the minimum node is 
shown in Figure 2.6. 

In the second step, as shown in Figure 2.7, we reduce the number of trees in the 
Fibonacci heap and find a new minimum node by consolidating the root list of the 
Fibonacci heap. Consolidating the root list consists of repeatedly executing the following 
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proc insert(h, x) 
% insert new node z into heap A 


Initialize node x by updating its degree, p, child, 
left, right, and mark fields properly 
Put x into root list of h 
if (minfh] = nil) or (key[x] < key[min[h]]) then 
min{h] := x 


ao m & b= 


end 


Figure 2.5: Insert operation of Fibonacci heap 


steps until every root in the root list has a distinct degree value. 


1. Find two roots z and y in the root list with the same degree, where key[z] < key[y]. 


2. Link y to z: remove y from the root list, and make y a child of z. 


In lines 16-23, the consolidation process finds the current minimum node in the root 
list. The amortized time taken by the extract operation is O(/g n). 


2.2.5 Decrease Key Operation 


The decrease key operation for a Fibonacci heap is shown in Figure 2.8. To decrease the 
key of node z to a value k, we first replace z’s key with k in lines 1-4. If the heap-order 
is violated(i.e., k < key[y] where y is the parent of z), we cut z from y in line 7, and 
make z a root. From the Fibonacci heap constraints, an interior node can only lose one 
child; further cascading cuts are performed at line 8 to satisfy this constraint. The 
amortized cost of the decrease key operation is O(1). 


2.2.6 Delete Operation 


Deleting a node z from a Fibonacci heap can be viewed as making node z the minimum 
node in the heap by decreasing its key to —oo, then removing node z from the Fibonacci 
heap with the extract operation; this is shown in Figure 2.9. 

The amortized time of delete is the sum of the O(1) amortized time of decrease key 


and the O(lg n) amortized time of extract. 
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proc extract(h) 


1 z := minfh] 
2 if (2 # nil) then 
3 for each child x of z do 
4 add x to the root list of h 
5 p(x] := nil 
6 remove z from the root list of h 
7 if (z = right[z]) then 
% z is the only node in the heap 
8 min{h] := nil 
9 else 
10 min{h] := right[z] 
% consolidate the heap and find nezt min 
11 consolidate(h) 
12 end 
13. end 


14. end extract 


Figure 2.6: Extract operation of Fibonacci heap 


The delete operation could be improved by directly removing the node from the heap 
instead of first putting it into root list and then taking it out. However, the amortized 


time bound would not improve. 


2.3. B-Tree 


B-trees [BS77][Com79] are balanced search trees designed to work well on magnetic disks 
or other direct-access secondary storage devices. The guaranteed small search, insertion, 
and deletion time of B-trees makes them quite appealing for database applications. Nev- 
ertheless, we will see later on that the B*-tree{MR85], a variant of the B-tree, could 
also serve as a priority queue. In this section, we briefly describe the Bt-tree that is 
well suited for use in a concurrent database system. More information can be found in 
[CLR90][LY81][Wed74]. For simplicity, we denote B+-tree as B-tree in this thesis. 
Figure 2.10 shows an example of B-tree internal and leaf nodes. A B-tree has the 


following major properties: 
1. Each path from the root to any leaf has the same length, h. 


2. Each node contains at most 2k + 1 elements, in which k is a tree parameter. Each 
node contains at least one element. There are other variations of B-trees that 
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proc consolidate(h) 


% initialize an array for compacting trees with the same degree 
for i := 0 to DEGREE—UPPER-BOUND do 
Afi] := nil 
% compact the trees with the same degree 
for each node w in the root list of h do 
x= WwW 
d := degree[x] 
while (A[d] # nil) do 
y := Afd] 
if (key[x] > key[y]) then 
Exchange(x, y) 


link(h, y, x) 
A[d] := nil 
d:=- d+1 
end 
end 
A[d] := x 


% find the nezt node with the minimum key 
min(h] := nil 
for i := 0 to DEGREE-—UPPER-BOUND do 
if (A[i] 4 nil) then 
add A[i] to the root list of h 
if (minfh] = nil) or (key{A[i]] < key{minfh]]) then 
min{h] := Aji] 
end 
end 
end consolidate 


proc link(h, y, x) 
remove y from the root list of h 
make y a child of x, incrementing degree[x] 


mark[y] := false 
end link 


Figure 2.7: Consolidate operation of Fibonacci heap 
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proc decrease(h, x, k) 
% decrease the key of x to k 


if (k > key[x]) then 
error “new key is greater than current key” 
end 
key[x] := k 
y = pix] 
if (y # nil) and (key[x] < keyly]) then 
% the heap order is violated 
cut(h, x, y) 
cascading—cut(h, y) 


end 

if (key[x] < key[min[h]]) then 
min{h] := x 

end 


end decrease 


proc cut(h, x, y) 


Remove x from the child list of y, decreasing degree[y] 
Add x to the root list of h 

pix] == nil 

mark[x] := false 

end cut 


proc cascading—cut(h, y) 


z := ply] 
if (2 # nil) then 
if (mark[y] = false) then 
% y has lost one child 
mark[y] := true 
else 
% y has lost two children 
cut(h, y, 2) 
cascading—cut(h, z) 
end 
end 
end cascading—cut 


Figure 2.8: Decrease operation of Fibonacci heap 
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proc delete(h, x) 
% delete node z from Fibonacci heap z 


1 decrease(h, x, —oo) 
extract(h) 
3 end delete 


i) 


Figure 2.9: Delete operation of Fibonacci heap 


(a) B-tree internal node 


(b) B-tree leaf node 


Figure 2.10: Structure of B-tree 


require each node to contain at least k + 1 elements. 


3. The keys of all of the data in the B-tree are stored in the leaf nodes. Nonleaf nodes 
contain pointers and the key values to be used in following those pointers. 


4. Within each node, the keys are in ascending order. 


5. In nonleaf nodes, each pointer, P;, points to a subtree JT; whose root is the node 
that P; points to. The values stored in T; are bounded by the two key values, K; 
and Kj41, to the “left” and “right” of P, in the node(i.e., the set of values stored 
in subtree T; is bounded by K; < v < Ki41). 


B-trees have internal nodes that look like those shown in Figure 2.10(a). The A; are 
instances of the key domain, and the P; are pointers to other nodes. On the leaf level, 
B-tree nodes, as shown in Figure 2.10(b), contain keys and other information associated 
with them. 
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To insert a new node with key newkey into the B-tree, we start from the B-tree 
root and move downwards from each nonleaf level following the pointer P; that has two 
neighbors K; and K;j4, satisfying K; < newkey < Kj,,;. When a leaf is found, newkey 
is inserted if there is room; otherwise, the leaf is split, and the split may propagate back 
up the tree. 

The delete operation first locates the leaf that stores the key oldkey to be deleted. 
The locating process is just like that in the insert operation. Once the leaf is found, 
oldkey is removed from it. If the leaf is then empty, it is merged with its neighbor, and 
the merge may propagate back up the tree. 

To use a B-tree as a priority queue, the insert operation remains the same; the extract 
operation is implemented by deleting the smallest key from the leftmost leaf of the B-tree. 
In fact, if we maintain a direct pointer to the leftmost leaf of the B-tree, we can avoid 
the locating process used in the delete operation. 

The insert operation takes time proportional to the height of the B-tree, O(lg n), 
where n is the number of keys stored in the tree, and the extract operation takes time 


O(lg n) including merging leaves and internal nodes. 
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Chapter 3 


Parallel Fibonacci Heap and 
Concurrent Access Algorithms 


In this chapter, we present our design for a parallel Fibonacci heap that is based on the 
sequential Fibonacci heap described in Chapter 2. The parallel Fibonacci heap maintains 
the advantages of its sequential counterpart, i.e., its asymptotically more efficient oper- 
ations, and it also has linearly scalable throughput as shown in Chapter 5. The parallel 
Fibonacci heap reduces contention by weakening the semantics of the extract operation: 
an extract operation need not return the minimum element in the heap, instead it can 
return a promising element close to the minimum where the promising quality can be 
controlled. The non-strict semantics of the extract operation for the parallel Fibonacci 
heap is elaborated in Section 3.1. Section 3.2 presents a concurrent access algorithm for a 
doubly linked list. Section 3.3 gives a description of the data structure of the parallel Fi- 
bonacci heap. The concurrent access algorithms are presented in Section 3.4. Section 3.5 


summarizes this chapter. 


3.1 Semantics of Parallel Fibonacci Heap 


The semantics of the insert, decrease, and delete operations on a parallel Fibonacci heap 
remain the same as on a sequential Fibonacci heap presented in Section 2.2, but the 
semantics of the eztract operation are non-strict. The sequential Fibonacci heap has a 
strict extract operation in the sense that it always extracts the minimum node from the 
heap. However, for parallel Fibonacci heaps, since there are potentially many processes 


extracting nodes concurrently, strict semantics are undesirable for two reasons: 
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e In terms of correctness, strict semantics are not required in most, if not all, parallel 
priority queue applications. However, it is usually desirable to control the quality 
of extracted nodes to meet applications’ requirement. The strict extract opera- 
tion usually involves more contention, and doesn’t extract more promising nodes 
overall. For example, suppose there are 5 processes concurrently trying to extract 
nodes from a priority queue that contains 5 highest priority nodes nl, n2, n3, n4, 
and nd. In the case of strict semantics, the extract operations have to be serialized 
and get nl to n5 one at a time. This creates a bottleneck. If we adopt non-strict 
semantics, we potentially can extract nl to n5 concurrently without blocking, and 
the extracted nodes n1 to n5 will be the same as those extracted with strict seman- 
tics, although the order in which they are extracted may differ. The concurrent 
access algorithms presented in Section 3.4 provide methods to control the promising 
extent of extracted nodes. 


e Realizing strict semantics for parallel implementations is expensive, since we have to 
linearize all operations; this creates severe bottlenecks. There is a tradeoff between 
strictness and contention. The stricter the semantics, the greater the contention 
on a priority queue. The experiments in Chapter 5 show that a strict scheme for a 
concurrent binary heap saturates when the number of processes is more than about 


eight. 


Instead of having a min pointer to the minimum node in the heap, our parallel 
Fibonacci heap has a promising list that is an array of pointers to some promising nodes 
in the root list. We will look into the extract operation in section 3.4. 


3.2 Concurrent Operations on a Doubly Linked List 


A doubly linked list(DLL) is a data structure in which the objects are arranged in 
linear order and every object has a key field and two other fields: left and right. Given an 
object zin a doubly linked list, right/z/ points to its successor in the list, and /eft/z] points 
to its predecessor. The insert and delete operations take only constant time provided that 
we know where to insert an object and which object to delete. Searching an n-object list 
takes O(n) time. 

Concurrent insert and delete operations are more complicated than their sequential 
counterparts. Let’s consider concurrent insertion, concurrent deletion, and concurrent 


insertion and deletion separately. 
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Figure 3.1: Concurrent insertion on DLL 


3.2.1 Concurrent Insertion on DLL 


Inserting a node N into DLL LIST, as shown in Figure 3.1(a), takes two steps: 


1. Find two neighbor nodes L and R in LIST to insert N between. 


2. Modify the right field of Z, the left and right fields of N, and the left field of R. 


In the second step, we have to ensure that the fields are updated atomically. Doing 
so involves locking certain fields in some nodes (e.g., the right field of L). However, this 
could cause a bottleneck if there are many processes trying to insert new nodes between 
L and R, as they all have to lock the right field of Z during insertion. Thus, it would 
be better to spread out insertions among the nodes in LIST, preferably as evenly as 
possible. One way to do this is to place a set of dummy nodes in LIST, as shown in 
Figure 3.1(b). Dummy nodes are similar to normal nodes in the DLL, except they are 
marked dummy, can be accessed directly’, and remain in the DLL all the time. We define 


1For example, we can have an array of pointers to the dummy nodes so that they can be accessed 
directly from the array. 
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Figure 3.2: A scenario of concurrent deleting N and R of DLL without locking them 


a section of DLL to be the sub-DLL between two dummy nodes as shown in Figure 3.1(b). 
The insert operation on LIST is now the following: 


1. Randomly choose a dummy node D. If D’s right field is locked, we can try another 
dummy node; otherwise, lock D’s right field. 


2. Insert the new node to the right of D, and update the right field of D, the left and 
right fields of the newly inserted node, and the left field of D’s old right neighbor. 


The number of dummy nodes needed in LJ ST depends on the access frequency and 
applications. We will see in the following section that the dummy nodes also help the 


delete operation. 


3.2.2 Concurrent Deletion on DLL 


Deleting node N from its two neighbors L and R, as shown in Figure 3.1(a), changes 
the right field of Z and the left field of R. The left and right fields of N may also need 
to be changed. The right field of Z and the left field of R have to be locked for proper 
deletion. Moreover, the left and right fields of N must be locked too. Otherwise, the 
following scenario may arise when deleting N and R concurrently, as shown Figure 3.2, 
which results in a broken list. 
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Delete N Delete R 

Lock right[LZ] Lock right[N] 

Lock le ft[R] Lock left[J] 

Set right(L] pointing to R | Set right[N] pointing to J 
Set le ft[R] pointing to L | Set left[J] pointing to N 
Clear le ft[N], right[N] Clear le ft[R], right[R] 


To avoid deadlock, we lock the fields in a particular order: first lock the right field of 
L, then the left and right fields of N, finally the left field of R. We could still deadlock 
if we did not have dummy nodes in the LIST. One example is to delete the only node 
N in a circular DLL. In this case, N itself is both its left and right neighbor, which will 
cause the locking process, described above, to deadlock. This problem could be avoided 
by keeping track of the number of nodes in the circular DLL, and treating deletion of 
the only node in a circular DLL as a special case. However, there is another situation 
that is similar to the dining philosophers problem and that can’t be gracefully avoided 
without dummy nodes. Suppose there are n nodes in the circular DLL LIST and n 
processes deleting nodes concurrently in a conspired way: each process is deleting a 
different node, and each process is executing the locking process synchronously. This will 
create a circular locking chain. Dummy nodes will prevent this form of deadlock chain. 

Dummy nodes are not sufficient to prevent all locking problems. Consider the follow- 
ing situation: while deleting N, we have to lock the right field of L. We find L by using 
left{N]. But at the time of the lookup, left[N] has not been locked, which means the 
field may be changed by another process. Although this problem can be overcome by 
using complex locking methods, the method described below using scavenger processes 


seems simpler and more elegant. 


3.2.3 Concurrent Insertion and Deletion on DLL 


The complexity of parallel operations on this relatively simple data structure is caused by 
allowing the concurrent removal of nodes from the list. We can get better performance if 
we disallow concurrent removals in the following way: deleting N only marks N as dead, 
and all dead nodes are actually removed from the DLL by scavenger process(es), which 
run as background or periodic foreground processes. Each scavenger process locks one 
section, and removes dead nodes from that section. Since the DLL is nicely divided by 
the dummy nodes into sections, we avoid deadlock and interference problems by allowing 
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proc insert(obj, dil) 
% Insert an obj into doubly linked list dll 


1 Randomly find a unlocked dummy node d in dll, and lock right/d] 
2 Insert obj to the right of d 
3 Unlock d 
4 end insert 
proc delete(obj, dll) 
% Delete obj from doubly linked list dil 
1 Mark obj to be dead 
2 Occasionally do 
3 Randomly find a unlocked section s and lock it 
4 for every obj in s do 
5 if (obj is not the right neighbor of a dummy node) and 
6 (obj is marked dead) then 
7 remove obj from dll 
8 end 
9 unlock s 
10 end delete 


Figure 3.3: Concurrent operations on doubly linked list 


at most one scavenger process to operate on each section. This kind of distributed 
scavenging method alleviates the complex locking problem described in the last section. 

Figure 3.3 gives the pseudocode for concurrent operations on a DLL. The insertion 
operation is the same as that described in Section 3.2.1. The delete operation occasionally 
locks a section, and removes dead nodes in it. With the help of dummy nodes, there is 
not much contention on the DLL. The insert and delete operations on a DLL still take 


constant time. 


3.3. Data Structure of Parallel Fibonacci Heap 


A parallel Fibonacci heap, as shown in Figure 3.4, is a collection of trees whose roots 
are linked in a circular DLL with dummy nodes as described in Section 3.2. Instead 
of having one min pointer to the root of the tree containing a minimum key, there 
is an array of pointers to the roots of the trees having promising keys. The array is 
called the promising list. For convenience, we use “node in promising list” to mean 
“node pointed to by some pointer in the promising list” in this thesis. There is a lock 
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Figure 3.4: Structure of parallel Fibonacci heap 


associated with each pointer in the promising list. The size of the promising list, mazpt, 
is a parameter that can be controlled in the algorithm. Besides having the fields of their 
sequential counterparts, such as left, right, parent, child, key, degree, and mark, the 
nodes in a parallel Fibonacci heap have some synchronization fields — there are three 
locks associated with the left, right, and key fields of a node, respectively. In addition, the 
mark of a node can be one of dummy, dead, promising, unmarked, and marked. Dummy 
means the node is a dummy node as described in section 3.2, dead means the node has 
been deleted, promising means the node is a promising node, and unmarked and marked 
are used in the same way as in the sequential algorithms to denote whether the node 
has lost a child since it became an interior node. As in the DLL, a section of a parallel 
Fibonacci heap contains the trees between two dummy nodes as shown in Figure 3.4. 


3.4 Concurrent Access Algorithms 


In this section, the concurrent access algorithms for the parallel Fibonacci heap are 
presented. In these algorithms, we use a method to minimize blocking time and en- 
hance throughput called the check-lock-verify method. The check-lock-verify method is 
a high-level, efficient, non-blocking test&do atomic operation, which is described as the 
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“cheating” method in [Bir89]. Here is a comparison of test&do and check-lock-verify: 
check-lock-verify: test&do 


If (conditions are met) then 


Lock critical section Lock critical section 
If (verify conditions are met) then | If (test conditions are met) then 
do things in critical section do things in critical section 
else else 
Unlock and exit Unlock and exit 
endif endif 
endif 


The check-lock-verify method asynchronously checks conditions before entering the 
critical section, while test&do enters the critical section first. In this way, the check-lock- 
verify method avoids some possible blocking time on the critical section, if the conditions 
are not met. However, the semantics of the check-lock-verify method are different from 
those of test&do in the sense that the latter is stricter. Test&do guarantees that the con- 
ditions are checked inside a critical section, while the check-lock-verify method first checks 
the conditions outside the critical section. Only when the conditions can be correctly 
atomically read?, are the semantics of test&do and check-lock-verify the same. There 
are many places in the algorithm where the check-lock-verify method can be used. The 
check-lock-verify method makes programs look more complex and harder to understand, 
thus, it is normally not included in the pseudocode listings presented in this section. 


3.4.1 Insert Operation 


As shown in Figure 3.5, inserting a new key k into a parallel Fibonacci heap h is very 
similar to inserting a key into a DLL. First a new heap node n is created with key k, and 
the other fields are properly set. In lines 2-5, we randomly find a dummy node D in the 
root list, lock the right field of D, and insert the new node to the right of D. Actually, 
if we find that right[D] has already been locked while trying to lock it at line 3, another 
dummy node can also be tried. The insert operation ensures that all nodes are inserted 
evenly among the dummy nodes in the root list. 

In lines 6-8, we check whether the newly inserted node n with key k is promising; this 
is similar to checking whether the newly inserted node is better than mzn in the insert 


?These features are often machine dependent. The programmer should always check these features 
before taking advantages of them. 
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proc insert(id, h, k) 
% Insert key k into parallel Fibonacci heap h. id is an issued worker id 


Initialize a new node n with key k 
Randomly choose a dummy node D in the root list 
Lock right[{D] 
Put n to the right of D 
Unlock right[D] 
if good(id, h, k) then 
check—promising(h, n) 
end 
return n 
end insert 


SCOMANAN HOH 
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proc good(id, h, k) 
% a heuristic function that tests whether k has a good chance to be promising 


if (k > last—extract[id] * strictness) then 
return NO 

else 
return YES 

end 

end good 


Aon Pw D Ee 


Figure 3.5: Insert operation on parallel Fibonacci heap 


operation on a sequential Fibonacci heap. In order to avoid checking some “obvious” non- 
promising nodes, a heuristic function good is designed to filter out most non-promising 
nodes. If the heuristic function says k is good, then we actually check whether node n is 
promising, as presented in the next section; otherwise, the node n still has a chance to 
be put into the promising list by the consolidation process described in section 3.4.4. 

I have designed a simple “distributed” heuristic function as shown in Figure 3.5. 
Suppose there is a fixed number of workers doing operations concurrently on the parallel 
Fibonacci heap (see chapter 5); each worker is assigned an id to distinguish it from 
the others. If a given application doesn’t fit this worker model, we can still map the 
operations performed by the application on the parallel Fibonacci heap to some number 
of virtual workers. Worker id keeps track of the key of the node it most recently extracted 
in last — eztract[id]; this is used as a rough measure of whether a key k is good or 
not. If & is greater than last — ertract|id] x strictness, in which strictness is a tunable 
parameter(usually set to be around 1), then k is not treated as good. The heuristic 
function gives real promising nodes a chance to bypass the consolidation process and 
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proc check—promising(h, n) 
% Check if node n is more promising than any already promising node prom—one, then 


% replace prom—one with n in the promising list. 

1 for every pointer prom—pt in promising list do 

2 Lock prom—pt % if prom—pt is locked, we can try next 
3 if prom—pt = nil then 

4 Lock key(n] 

5 if (mark[n] # dead) and (mark[n] # promising) 
6 and (parent{[n] = nil) then 

7 mark(n] := promising 

8 prom—pt := &n 

9 Unlock key(n] 

10 Unlock prom—pt 

11 return YES 

12 end 

13 Unlock key/n] 

14 Unlock prom—pt 

15 else 

16 prom—one := *prom—pt 

17 Lock key{prom—one] 

18 Lock key[n] 

19 if ((mark[{n] # dead) 
20 and (mark(n] # promising) 
21 and (parent([n] = nil) 
22 and ((mark[prom—one] = dead) 

23 or ((mark[prom—one] = promising) 
24 and (key[prom—one] > key[n])))) then 
25 mark[n] := promising 
26 if mark[prom—one] = promising then 
27 mark[prom—one] := unmarked 
28 end 

29 prom—pt := &n 

30 Unlock key{n] 

31 Unlock key[prom—one] 

32 Unlock prom—pt 

33 return YES 

34 end 

35 Unlock key[n] 

36 Unlock key{prom—one] 

37 Unlock prom—pt 

38 end 


39 return NO 
40 end check—promising 


Figure 3.6: Check whether a node is promising in parallel Fibonacci heap 
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directly be put into the promising list. We can tune strictness to control the quality of 
nodes in the promising list. The smaller the value of strictness, the better the nodes in 
the promising list, and possibly the longer it takes to find a promising node. So, there is a 
tradeoff here between strictness and contention on the queue. The experiments described 
in Chapter 5 show how the throughput varies with strictness. Moreover, strictness can be 
made adaptive depending on the feedback of check-promising: if check-promising always 
returns yes which means the heuristic function may be too strict, then strictness can be 
loosened to some degree; if check-promising always returns no, which means the heuristic 
function may be too loose, then strictness can be tightened a bit. 


3.4.2. Check-Promising 


Figure 3.6 shows how to check if node n is more promising than one of the already 
promising nodes in a parallel Fibonacci heap h. Basically, n is compared with every node 
in the promising list; if a nil pointer in the promising list or a promising node with key 
larger than key[n] is found, then n is put in the promising list; otherwise n is simply 
left in the root list. In the pseudocode, lines 1-2 loop over all pointers in the promising 
list, and try to lock each one before checking. In fact, if the pointer prom-pt is found 
already locked in line 2, we can try other pointers in the promising list. If prom-pt is a 
nil pointer, lines 4-14 check if n is not dead or promising and n is in the root list, then 
put n into the promising list by changing prom-pt to point to n. If prom-pt is not nil, 
lines 16-39 test if n is more promising than node prom-one pointed by prom-pt, then 
replace prom-one with n. Lines 19-21, like lines 5-6, make sure that n is not dead, is not 
already promising, and is in the root list before making it promising. 

The check-promising procedure is non-blocking in the sense that it does not block 
on a locked pointer in the promising list; instead it always tries to find a free promising 
pointer to lock. Also, since the heuristic function good filters out most non-promising 
nodes from being checked, there should not be much contention on the promising list. 
The time taken to check whether a node is promising is constant, O(mazpt). 


3.4.3. Extract Operation 


Figure 3.7 shows how to extract a node from a parallel Fibonacci heap h. Since we already 
have the promising list, if it is not empty then we can randomly remove a promising 
node from it; otherwise, we find several promising nodes to put in the promising list by 
consolidating a section of the heap, and retry the extract operation. 
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Line 1 randomly chooses a pointer prom-pt from the promising list. Then we try 
to lock prom-pt in line 6; if it has already been locked, we try another pointer in the 
promising list. Line 7 checks if prom-pt is nil; if it is, we pick up another pointer from 
the promising list and repeat the process of locking and checking prom-pt. Otherwise, we 
lock the node prom-one pointed to by prom-pt. If prom-one is indeed a promising node, 
we put its children, if any, into the root list, and take prom-one out of the promising 
list by marking it dead in lines 14-21. If prom-one is not promising, we simply try other 
pointers in the promising list in lines 23-25. If after trying “enough” times, we still fail 
to find a promising node, then it is time to consolidate the heap in lines 3-5; that will 
compact trees together, and find some promising nodes to put in the promising list. 

The promising list is implemented as an array in which each pointer can be directly 
accessed, and the size of the promising list can be controlled*® to reduce contention. The 
extract operation never blocks on a locked pointer in the promising list; therefore, we 
do not expect much contention on grabbing a pointer from the promising list. The time 
taken to extract a promising node is constant, if we successfully find a promising node 
in the promising list. Otherwise, the extract time is the time spent consolidating a 
section of the parallel Fibonacci heap. This, we will see in next section, is logarithmic 
in the number of nodes in that section. Thus, the time taken to do extract operation is 
O(lg |section|). 


3.4.4 Consolidate the Parallel Fibonacci Heap 


When a process performing an extract operation cannot find a promising node in the 
promising list after some number of probes, it consolidates the heap, actually a section 
of the heap, as described in Figure 3.8. The consolidate process randomly chooses a 
section that is not already being consolidated by another process and locks the section. 
The process then walks through the nodes in the root list of the section. If a root 
node is marked as dead, we remove it in lines 10-14. Since there is always at most one 
consolidation process in a section, there is at most one removal operation running in a 
section, so we don’t have to lock a dead node’s neighbors while removing it from the 
DLL. When a dead node and a dummy node are neighbors, between which there may be 
insertions going on, we just choose not to remove the dead node. 

The consolidation process keeps track of several good nodes that are not already in 
the promising list by comparing all the non-promising and non-dead nodes in the root 


3The size is usually chosen to be the number of processes accessing the heap. 
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list of the section, and puts them in buffer B. We can then “flood” B into the promising 
list by running check-promising on all the nodes in B after finishing the walk through the 
section in lines 19-20. Buffer B is implemented as a sorted array of fixed size, buffersize. 
A smaller buffersize means the nodes in the buffer tend to be more promising. We show 
the results of experiments that vary buffersize in Chapter 5. 

The consolidate process also performs normal consolidation like its sequential coun- 
terpart. It merges trees of the same degree to reduce the number of trees in the root 
list. If the root node of a tree is dead or promising, then it won’t be merged with other 
trees. When merging two trees rooted at z and y respectively, we have to lock key[z] and 
key[y] first. The reason for locking is that there may be delete and decrease operations 
going on that will interfere with the consolidate process. 

The consolidation time for the parallel Fibonacci heap is basically the same as the time 
taken for the sequential consolidation, because there is only one consolidation process in 
each section, and the consolidation process only does a little more work than its sequential 
counterpart: it finds more promising candidates (buffersize per process), and there are 
some locks required when merging trees. These locks are used to prevent operations like 
delete and decrease key from getting in. The delete and decrease key operations can be 
operated on all nodes in the Fibonacci heap, not just nodes in the root list. In fact, most 
of these operations, like deleting some non-promising nodes and decreasing keys of some 
non-promising nodes, tend to happen to nodes not in the root list. Thus, we expect little 
contention on the locks the consolidate process acquires while merging trees. Overall, 
each consolidation process runs in time O(lg |section|) time. 


3.4.5 Controlling the Quality of Extracted Nodes 


There are several parameters that control the promising quality of extracted nodes: 
mazpt, buffersize, and strictness. Mazpt is the size of the promising list, buffersize is 
the size of the buffer used during the consolidation process to gather candidates for the 
promising list, and strictness is used in the heuristic function good. We can see that a 
smaller value of marpt means that the nodes in the promising list are more promising. 
The extreme case is that mazpt equals 1 — there is only one pointer as in the sequential 
Fibonacci heap. On the other hand, a smaller mazpt implies more contention on the 
promising list. A good value of mazpt might be the number of “workers” on the parallel 
Fibonacci heap. 

In the consolidation process, the top buffersize number of non-promising nodes in the 
root list of a section are gathered in a buffer, and are checked if they are promising. 
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The smaller buffersize is, the better the nodes the buffer contains, the fewer candidate 
nodes there will be for the promising list, and the longer the time it takes to extract a 
promising node. On the other hand, larger buffersize incurs more traffic on the promising 
list, because there will be more check-promising processes trying to put nodes into the 
promising list. 

The effect of the parameter strictness is explained in Section 3.4.1. Experiments that 
vary these parameters are presented in Chapter 5. 


3.4.6 Decrease Key Operation 


Figure 3.9 shows the pseudocode for decreasing the key of node z to k. Like the sequential 
decrease key operation discussed in Chapter 2, the idea of the concurrent decrease key 
operation is to check if k is smaller than z’s old key, and then change z’s key to k. After 
the key change, if the heap order property is violated, then cut z from its parent; if 
an internal node loses more than one child then perform cascading cuts. Cut(h, 2) will 
change z’s parent link and its parent y’s child link. Both z and y have to be locked during 
the operation. The order of locking is important here; the wrong locking order can cause 
deadlock. Consider the case of locking in bottom-up order where y is a promising node 
in the root list, ¢ is one of y’s children, and there is a decrease key operation that is 
trying to cut z from y. Suppose the decrease key operation has already locked z, and is 
trying to lock y. In the mean time, another process is doing an extract operation on y, 
having locked y, and is trying to put y’s children, including z, into the root list. In the 
process of putting y’s children into the root list, z’s parent field will be updated. If we 
require locking z before updating its parent field, then this results in a deadlock. If we 
update z’s parent link without locking it, it would be dangerous for the decrease process 
to read it. 

Figure 3.9 shows a way to lock in a top-down order that avoids the problem described 
above. This locking order also makes the extract operation easier. When we put y’s 
children into the root list in the extract operation as described above, we only need to 
lock y, because in the top down locking order, y’s children won’t be updated unless y has 
been locked. The decrease key operation works in two phases: Phase 1 locks z, locates 
its parent y if there is one, and unlocks z. Phase 2 locks y then 2, verifies y is still z’s 
parent, and does things as in the sequential case. If y is no longer z’s parent in phase 
2, we go back to phase 1 to locate z’s parent again. In phase 1, lines 5-10 lock z, check 
whether zx has a parent. If not, line 13 sets z’s key; otherwise, line 17 sets the variable 
has-parent? to be true for use in phase 2. Phase 2 checks if variable has-parent? is true, 
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then locks y and x. After y and z have been locked, we verify if y is still z’s parent, then 
change z’s key and do the cut in lines 29-36 as in the sequential decrease key operation. 
Finally, cascading-cuts are done if needed in lines 43-44. If it turns out that y is no longer 
z’s parent in phase 2, then we go back to phase 1 to find z’s current parent, and repeat 
the whole process until z’s true parent is found. 

If there is no other operation updating z or y between phases 1 and 2, which is 
likely to be the common case, the parental relationship between y and z does not change 
between phases 1 and 2. Thus, in most cases, the decrease key operation succeeds without 
repeating phase 1 and 2. Also, the contention on z and y should be relatively small, since 
it should be rare that different workers are doing operations on the same x and y. The 


time taken to do the decrease operation is O(1). 


3.4.7 Delete Operation 


The delete operation, as shown in Figure 3.10, is similar to the decrease key operation. 
Instead of cutting z and putting it in the root list as in the decrease key operation, we 
put z’s children into the root list in lines 12 and 28, and mark z to be dead in line 13 if 
zx is in the root list; or remove z in line 29 in case it is an interior node. 


3.4.8 Algorithm Validation 


We informally show that the algorithms for the parallel Fibonacci heap are deadlock-free 
as follows. Horizontally, the root list of the parallel Fibonacci heap is a DLL with dummy 
nodes, and we have shown that the operations on a DLL are deadlock-free in Section 3.2. 
Vertically, the parallel Fibonacci heap is a forest of trees, and we always lock nodes in a 
top-down order in the algorithms. 

We also validated the correctness of operations experimentally: we occasionally ran 
a verify-form procedure to check the syntactic correctness of the heap (i.e., whether the 
number of nodes in the heap, the number of nodes in the root list, and the number of 
promising nodes are correct) and the semantic correctness of the heap (i.e., that the 
parallel Fibonacci heap is in correct heap-order, and satisfies the heap constraints). 


3.5 Summary 


The parallel Fibonacci heap presented in this chapter is based on the sequential Fibonacci 
heap described in Chapter 2. The parallel Fibonacci heap maintains the asymptotic time 


40 


bounds of its sequential counterpart, and it also achieves linearly scalable performance. 
The parallel Fibonacci heap has the following properties: 


1. The locks each operation acquires are evenly distributed over the entire data struc- 
ture and the time each operation takes while holding a lock is small. Assuming 
the size of the structure is relatively large compared with the number of processes 
accessing it, then there is very little contention on the structure and we expect lin- 
early scalable throughput. This scalability is reflected in the performance analyses 
in Chapter 5. 


2. Ignoring contention, the sequential operations’ time bounds have been preserved: 
an insert operation takes only constant time, an eztract operation takes O(lg n) 
time, a decrease key operation takes constant amortized time, and a delete operation 
takes O(lg n) time. 


3. The priority queue is non-strict in the sense that an extract operation does not 
necessarily return the most promising node, but the promising quality can be con- 
trolled as described in Section 3.4.5. These non-strict semantics are compatible 
with most parallel applications, if not all, and they are also one of the reasons that 
the parallel Fibonacci heap has relatively low contention. 
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proc extract(id, h) 
% extract a promising node from parallel Fibonacci heap h 
% id is a preassigned worker id 


Randomly choose a pointer prom—pt from the promising list 
(label#try) 
if we have tried “enough” times, but still fail to find a promising node then 
% "enough” can be tuned here 
consolidate(id, h) 
end 
Lock prom—pt % if prom—pt is locked, we can try another 
if prom—pt = nil then 
Unlock prom—pt 
prom—pt := another pointer in the promising list 
goto (label#try) 
else 
prom—one := *prom—pt 
Lock key[prom—one] 
if mark[prom—one] = promising then 
if prom—one has any children then 
put its children into the root list 
end 
mark[prom—one] := dead 
Unlock key[prom—one] 
Unlock prom—pt 
return prom—one 


Unlock prom—one 
prom—pt := another pointer in the promising list 
goto (label#try) 

end 


end extract 


Figure 3.7: Extract operation on parallel Fibonacci heap 
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proc consolidate(id, h) 
% Consolidate a section (or multiple sections) of the parallel Fibonacci heap h 
% and find candidates for the promising list 


Randomly find a section not being consolidated by other processes and lock it 
for every node x in the section do 
case mark[x]: 
unmarked, marked: 
Merge trees like in the sequential consolidation. Don’t merge dead 
or promising nodes. 
Maintain a buffer B of top buf fersize number of 
candidate nodes(non—promising nodes) for the promising list. 
% buffersize here ts tunable parameter 


dead: 
if x’s left neighbor is not a dummy node then 
Lock key[x] 
Remove x from root list 
Unlock key[x] 
end 
promising: 
dummy: 


end 
Unlock section 
for every node n in buffer B do 
check—promising(h, n) 
end consolidate 


Figure 3.8: Consolidate process on parallel Fibonacci heap 
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proc decrease—key(id, h, x, k) 
% Decrease the key value of z to k in parallel Fibonacci heap h 
done? := false % done? means whether the decrease operation has been accomplished or not 
has—parent? := false % has—parent? indicates whether node z has parent or not 
cascading—cut? := false % cascading—cut? indicates whether cascading—cut is needed 
Repeat AXEL NLL %%%%% Phase 1 
Lock key[x] 
if mark[x] = dead then 
Unlock key[x] 
return 
else 


Yo= 
if y = nil then % z doesnt have parent, it is in the root list 
if (k < key[x]) then 
key[x] := k 
end 
done? := true 
else 
has—parent? := true 
end 
end 
Unlock key(x] 
RNAKLLKENK%%%%%%%%%%% Phase 2 
if has—parent? then 
Lock key[y] % y was z’s parent, but may not be now, which happens rarely 
Lock key(x] 
if (parent([x] = y) then 
if mark[x] = dead then 
Unlock key[x] 
return 
else 
if (k < key[x]) then 
key[x] := k 
end 
done? := true 
if (key[x] < key[y]) then % heap order has been violated 
cut(h, x) 
cascading—cut? := true 
end 
end 
end 
Unlock key([x} 
Unlock key([y] 
end 
Until done? 
If cascading—cut? then 
cascading—cut(id, h, y) 
end 
end decrease—key 


Figure 3.9: Decrease key operation on parallel Fibonacci heap 
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proc delete(id, h, x) 
% Delete the node z from parallel Fibonacci heap h 


done? := false % done? means whether the delete operation has been accomplished or not 
has—parent? := false % has—parent? indicates whether node zt has parent or not 
cascading—cut? := false % cascading—cut? indicates whether cascading—cut is needed 
Repeat K4XLX%LXELNL%G%S%%%%% Phase 1 
Lock key[x] 
if mark[x] = dead then 
Unlock key[x] 
return 
else 
y := parent[x] 
if y = nil then % + doesn't have parent 
Put x’s children into root list if there are any 
mark[x] := dead 
done? := true 
else 
has—parent? := true 
end 
end 
Unlock key/x] 
LALKELLNLLLS%G%%%%% Phase 2 
if has—parent? then 
Lock key[y] %y was x’s parent, but may not be now, which happens rarely 
Lock key|[x] 
if (parent(x] = y) then 
if mark[x] = dead then 
Unlock key(x] 


return 
else 
Put x’s children into root list if there are any 
Remove x from y’s children list 
cascading—cut? := true 
done? := true 
end 
end 
Unlock key[x] 
Unlock key(y] 


end 
Until done? 
If cascading—cut? then 
cascading—cut(id, h, y) 
end 
end delete 


Figure 3.10: Delete operation on parallel Fibonacci heap 
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Chapter 4 
Concurrent Priority Pool 


In this chapter we present another kind of concurrent priority queue, which is imple- 
mented as a combination of a concurrent B-tree and a concurrent pool. We call this 
priority queue a “concurrent priority pool”. The concurrent priority pool supports in- 
sert and extract operations like the parallel Fibonacci heap. The extract operation is 
non-strict, as described in section 3.1, but there is a straightforward way of controlling 
the promising quality of extracted keys. The insert and extract operations do not share 
critical resources in most cases, so that the concurrent priority pool has the highest 
throughput among all the priority queues studied, as shown by the experimental results 
in Chapter 5. Section 4.1 briefly describes the concurrent B-tree. Section 4.2 gives an 
introduction to the concurrent pool. The concurrent priority pool and access algorithms 
are presented in section 4.3. Finally, Section 4.4 summarizes this chapter. 


4.1 Concurrent B-Trees 


The Concurrent B-Tree described here is mainly based on [Wan90, WW90, LS86, LY81]. 
This algorithm allows symmetric insertion and deletion in which each process locks at 


most one node at a time, except in rare cases. 


4.1.1 Data Structure 


The concurrent B-tree data structure is similar to the sequential B-tree described in 
Chapter 2. Figure 4.1 shows an example of a concurrent B-tree: A B-link structure is 
added into the sequential B-tree by connecting nodes on each level into a singly linked 
list. Each node has a right link that points to its right neighbor. Operations can go 
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level 3 


level 2 


level 1 


level 0 


leaf rightbound B-link 


Figure 4.1: An example of a concurrent B-tree 


across the linked list horizontally instead of vertically. An anchor, an array of pointers to 
the leftmost node on each level of the B-tree, is added into the sequential B-tree. With 
the anchor and the B-link structure, a node can be reached not only from its parent, but 
also from its left neighbors or the anchor. 


4.1.2 Insert Operation 


Inserting a new key k into a concurrent B-tree invokes two phases: the locate phase 
and the insert phase. The locate phase, which is similar to its sequential counterpart, 
traverses the B-tree from the root to the leaf level by following pointers P; in the internal 
nodes that have two neighbors K; and Kj4; satisfying K; < k < Ki4:. In the locate 
phase, only one internal node is locked at a time. In fact, the nodes only need to be 
read locked, since the nodes are not changed. After a leaf node n is located, we insert 
key k& into n. If n is full, we split n as shown in Figure 4.2. The split operation is 
done in two steps: a half-split as shown in Figure 4.2(b), followed by a complete-split 
as shown in Figure 4.2(c). Half-split creates a new node n’, inserts n’ to the right of 
n, and moves some data from n to n’. Complete-split goes up the tree, inserting a new 
< left bound, pointer > into n’s parent m. If m is full, then we split m in the same 
way as we split n. This split process can propagate from the leaf level up to the tree 
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Figure 4.2: Split a concurrent B-tree node (a) Before inserting key 10 into n (b) Half 
split n (c) Complete split n 


root, which might result in creating a new root, and increasing the B-tree height. In all 


situations, we write lock a node before updating it. 


4.1.3 Delete Operation 


The delete operation on a concurrent B-tree is symmetric to the insert operation. It 
consists of two phases: the locate phase and the delete phase. The locate phase is the 
same as that in the insert operation; it locates the node n containing the key k to be 
deleted. The delete phase removes k from n; if n is then empty, it merges n’s right 
neighbor n’ into n. The merge is also done in two steps: a half-merge as shown in 
Figure 4.3(b), and a complete-merge as shown in F igure 4.3(c). Half-merge first write 
locks n and n’ and removes n’ from its level’s linked list. It then moves data from n’ to 
n and sets the right link of n’ to n before unlocking n and n’. Processes that try to find 
data in n’ still can find them through its right pointer that forwards to n. Complete- 
merge removes a <left bound, pointer> pair from n’s parent m. If m is then empty, we 
merge m with m’s right neighbor. This merge process can propagate up to the tree root, 
which will possibly decrease the height of the tree. There is a special case when complete 
merging n and n’: if n and n’ do not have the same parent; this case is explained in 


[Wan90]. 


48 


Figure 4.3: Merge two concurrent B-tree nodes (a) Before taking key 10 out of n (b) Half 
merge n (c) Complete merge n 


4.2 Concurrent Pools 


Concurrent pools[Man86][KE89] are largely used in the assignment of resources and tasks 
to processors in a distributed or parallel system that needs to balance the load on each 
processor. A pool is a collection of items that grows and shrinks with the demands 
of the processes. A process may add an element to the pool or request an element 
from the pool at any time; the element removed from the pool is chosen arbitrarily. 
A concurrent pool attempts to spread the elements out over the processors so that 
accesses are less likely to interfere with each other. The basic idea of the concurrent pool 
is to allow most operations to be done within the local components of the distributed 
data structure. When a request cannot be satisfied locally, it becomes necessary to access 


remotely stored components. 


4.3 Concurrent Priority Pools 


The concurrent priority pool is based on the concurrent B-tree and the concurrent pool. 
It is similar to the concurrent B-tree, except that the leaves of the B-tree are replaced 
with concurrent pool-like data structures. An insertion into the priority pool is like the 
insertion into the B-tree, which takes O(lg n) time. The extract minimum operation on 
the priority pool is similar to the delete operation on the B-tree, but we always delete 
elements from the promising pools — the leftmost leaf in the B-tree. 
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Figure 4.4: Data structure of concurrent priority pools 


4.3.1 Data Structure 


The concurrent B-tree is the basis for the concurrent priority pool. Each leaf of the 
priority pool is similar to a concurrent pool — A leaf contains segnum number of data 
segments. Each segment consists of segsize number of keys and associated data. The 
segment is the smallest unit that is locked during the insert and extract operations. 
Even when splits and merges happen, leaves are only locked briefly, as we will see in the 
next few sections. There can be different operations running concurrently on different 
segments in the same leaf. 

As shown in Figure 4.4(a), a segment has an array of keys and associated data, a 
status indicator, a local separator, a lock and a local right link. The segment local 
separator is usually equal to the right bound of the leaf the segment is in, except in the 
middle of splitting or merging. The segment right link points to the leaf that contains 
keys equal to or larger than the segment separator; that is usually the right neighbor of 
the leaf containing the segment. The status indicator indicates whether the segment is in 
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normal mode or has been deleted. The segment can only be changed when the segment 
lock is acquired. 

The keys in a segment are stored in an array that is ordered from largest to smallest. 
This simplifies extracting the smallest key: we only need to return the rightmost element 
of the array and decrease the array size by one. Keeping segments sorted also makes it 
easier to find a medium key in a segment, which is used in splitting the segment. On the 
other hand, it is more expensive to insert a key in a sorted segment and to merge two 
sorted segments. 

A leaf has three major parts, as shown in Figure 4.4(b) : synchronization data, 
sequential data, and segnum number of segments. Sequential data consists of segnum, 
segsize, right bound, mark, right link, and separator. The right bound of the leaf is 
usually the largest key in the leaf. This is not true in two cases: when the leaf is being 
split, in which case there may be some larger keys that have not been moved to the right 
neighbor yet; or the when the leaf is being merged, in which case the right bound may be 
larger than all the keys in the leaf. The leaf mark is one of dead, orphan, dead-orphan, 
or nil: dead means the leaf has been deleted; orphan means that there is another leaf 
with the same right bound as this leaf, and the orphan leaves do not have parents as 
described in Section 4.3.2; dead-orphan means the leaf is both dead and an orphan. 

Synchronization data consists of a leaf lock, a status indicator, and a merging-leaf field 
that points to the leaf, if any, that has been merged with this one. The status indicator 
is one of normal, split, merging, split-merging, and deleted: normal means the leaf is in 
normal mode, split means the leaf is being split, merging indicates that the leaf is now 
merging with another leaf, deleted indicates the leaf has been deleted, and split-merging 
means there is a split and a merge concurrently going on in the leaf. Figure 4.5 depicts 
the possible status transitions of a leaf. The leaf sequential data and synchronization 
data can be changed only when the leaf-lock is acquired. 


4.3.2 Duplicate Keys 


The concurrent B-tree, the basis for the concurrent priority pool, is changed to allow 
duplicate keys. On the leaf level of the B-tree, we allow multiple leaves with the same 
right bound; only one of the leaves can be directly reachable from internal nodes, and the 
rest of them are marked as orphans. Thus, there are no duplicate separators in internal 
nodes. The original concurrent B-tree algorithms are changed slightly: 
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Figure 4.5: Concurrent priority pool leaf status transition graph 


1. While doing “complete-split” as shown in Figure 4.2(b), which tries to adda 
< separator, pointer > pair into internal node m, if we find there already exists a 
separator in m, then instead of adding the pair in, we mark the leaf pointed to by 


pointer as an orphan. 


2. While doing “complete-merge” as shown in Figure 4.3(b), which tries to delete a 
< separator, pointer > pair from an internal node, if we find the leaf pointed to by 
pointer is marked as an orphan, then we know the pair is not in an internal node. 


Thus, we can quit from complete-merge. 


This method treats all leaves, whether orphan or not, quite uniformly while doing 
insert and extract operations. It also keeps the structure of internal nodes the same, so 
that the original concurrent B-tree algorithms on internal nodes are still applicable. 


4.3.3. Insert Operation 


Inserting a key into a priority pool invokes two steps: first, locating a leaf as in the 
concurrent B-tree algorithms; second, as described in this section, inserting the new key 
into the leaf, and performing split operations if necessary. Here we only present the 
algorithms on the leaves of the priority pool, since the algorithms on the internal nodes 
are the same as those for a concurrent B-tree. Figure 4.6 shows the pseudocode for 
inserting a key in leaf | of tree. We first randomly locate a segment s in leaf /, and lock 
it in lines 1-2. Line 3 checks whether segment s is the right one to insert key in — if key 
is larger than separator|s], then we insert key into the leaf that is pointed to by right{s]. 
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proc insert(l, key, tree) 
% insert a new key into leaf | of tree 


1 (label#0) Randomly locate a segment s in leaf | 

2 (label#1) Lock s 

3 if (key > separator{(s]) 

4 1 := right[s] 

5 Unlock s 

6 goto (label#0) 

7 else 

8 case indicator{s] 

9 normal: 

10 if s is not full then 

11 insert key into segment s 

12 Unlock s 

13 else 

14 if we have not tried twice(or some other number) then 

15 Unlock s 

16 8 := another segment in leaf | 

17 goto (label#1) 

18 else 

19 Lock | 

20 case indicator[]]: 

21 normal: 

22 Unlock s 

23 originate—split(l, key) 

24 split, split~merging: 

25 Unlock | 

26 split(s, 1, 1’, separator{l], key) 
% V is I’s right neighbor; assume I’ and separator{l] 
% are read before | is unlocked 

27 Unlock s 

28 8 := another segment, goto (label#1) 

29 merging: 

30 Unlock s 

31 originate—split(1, key) 

32 deleted: 

33 Unlock s 

34 Unlock | 

35 insert(l’, key, tree) % lI’ is pointed by the right link of 1 

36 end 

37 end 

38 end 

39 deleted: 

40 insert(right[s], key, tree) 

41 end 

42 end 


43 end insert 


Figure 4.6: Insert operation on concurrent priority pool 
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We check indicators] in line 8: if s is deleted, then we insert key into the leaf pointed 
to by rzght([s] in lines 39-40. If s is in normal mode, we do “normal insertion” in lines 
10-38. Line 10 checks whether s is full; if not, we directly insert key into s. Otherwise, 
we try to find other segments in leaf ! to do the insertion in lines 15-17. If we still can 
not find a non-full segment in leaf ! to insert key after some number of tries, we try to 
split leaf / in lines 19-36. Leaf / is locked in line 19 to check the indicator of /. In case 
indicator[i] is normal, the originate-split procedure is called to originate splitting leaf /. 
In case indicator[l] is split or split-merging, which means leaf / is already being split, we 
unlock / and help split segment s in lines 25-26. We try to insert again in line 28. In case 
leaf / is merging with another leaf, / is split by calling originate-split in line 31. In case / 
has been deleted, though s has not been deleted yet, we insert key in the leaf pointed to 
by right{l] in line 35. 

Figure 4.7 shows the pseudocode of splitting a leaf of a concurrent priority pool. 
Procedure originate-split splits leaf / and inserts k into the priority pool. Procedure split 
splits a segment. 

At the entry of originate-split, we assume | has been locked. Line 1 checks indicator|I] 
and changes it as depicted in Figure 4.5: if it is normal, then it is changed to split; if it 
is merging, it is changed to split-merging. Line 6 creates a new empty leaf !’ with right 
bound, right link, segnum, segsize set to the same as those in leaf /. Line 7 chooses a 
separator for leaf I, and puts it in separator|]. Line 8 unlocks /; note that the leaf lock 
is held for a relatively short time (lines 1-8). Lines 9-12 split all segments in /. While the 
originate-split process is splitting segments in I, there can be other processes helping split 
segments in / — see line 27 of the insert procedure in Figure 4.6. After all the segments 
are split, | is locked to change indicator[I] back as shown in Figure 4.5. Once again, the 
leaf is locked for only a brief time. In line 20, key k is inserted into / or I’ depending on 
the chosen separator: if k is larger than sep, we insert k into l’ and vice versa. Line 21 
does complete-split by trying to add a new < separator(i], i’ > pair in l’s parent. 

Procedure split in Figure 4.7 splits segment s if it hasn’t been split yet — separator|s] > 
separator({], or it has been split — separator|[s] = separator[[] and s is still full. In either 
case, we move some data from s to its right neighbor I’. 

The time taken to insert a key into a concurrent priority pool is composed of the 
time taken to go from the tree root down to the leaf level, the time taken to insert the 
key into a leaf, and the time to do complete-split. We have seen that the leaf does not 
need to be locked if it is not split, and is only locked very briefly to change the indicator 
and link fields if a split happens. Thus, there is very little contention on inserting a 
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proc originate—split(1, k) 
% Originate splitting leaf | k is a key to be inserted. 


if (indicator[l] = normal) 
indicator[l] := split 


else 
%% indicator is merging 
indicator(l] := split—merging 
end 
Create a new empty leaf |’ and link it to the right of | 
separator(l] := choose—separator(s, 1) 
unlock | 
forall segment s in | do 
lock s 
split(s, I, 1’, separater[l], k) 
unlock s 
lock | 
if (indicator(]] = split) 
indicator[l] := normal 
else 
%% indicator is split—merging here 
indicator[]] := merging 
end 
unlock | 


insert k depending on sep 
Do complete—split as in the concurrent B—tree 
end originate—split. 


proc split(s, 1, 1’, sep, k) 
%% Assume s has been locked 
%% Split segment s in leaf | depending on separator sep. 


right(s} := &I 

if ((separator[s] > sep) or 
((separator[s] = sep) and 
full(s))) then 


separator(s] := sep 
move some data from | to |’ using 
sep as a filter ——— like the insert operation as futures 
end 
end split 


Figure 4.7: Split a leaf of concurrent priority pool 
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key into a leaf if there are enough segments in a leaf. The overall time taken to do the 
insert operation should be comparable to the time taken to do insertion in the concurrent 
B-tree, O(/g N), where N is the number of keys in the priority pool. 


4.3.4 Extract Operation 


The leftmost leaf of a concurrent priority pool contains keys smaller than keys in other 
leaves. The extract operation on a concurrent priority pool always extracts a key from the 
leftmost leaf. Since the anchor contains direct pointers to the leftmost node on each level, 
we can locate the leftmost leaf without going down from the tree root. This decreases 
the traffic through the root. 

The number of keys a leaf contains can be controlled, hence, the promising quality 
of extracted keys can be controlled — we can vary segnum and segsize to control the 
number of promising elements in the leftmost leaf. The extract operation always finds a 
key that is one of the segnum * segsize smallest keys in the concurrent priority pool. In 
practise, the extracted key is usually better than the given bound, because the smallest 
key in a segment is extracted first. 

Figure 4.8 shows the pseudocode for the extract operation. First, we randomly pick 
up a segment s in leaf / and lock it. We check whether s is in normal mode in line 
3. If not, we go to the leaf pointed to by s’s right link to do the extract operation in 
lines 38-40. Otherwise, we do “normal deletion” as following. If s is not empty then we 
extract the smallest key from s in line 5. If s is empty, then we can try other segments 
in lines 9-11. If we fail to find a non-empty segment in / after several tries, we merge / 
with its right neighbor in lines 13-34. We lock | to check indicator|[] in line 14. In case it 
is normal, the originate-merge procedure is called to start merging. In case indicator] 
is merging, we help merge some segments in leaf ! by calling the help-merging procedure 
at line 27. In case leaf / has been deleted, we go to l’s right neighbor to do the extract 
in lines 29-33. If leaf | is being split, we simply go back to try other segments, because 
we have not found a non-empty segment yet, so we can not help the split; if we find a 
non-empty segment, then the extract operation will be done. 

Figure 4.9 shows the pseudocode of the procedure originate-merging, which merges 
two leaves in the concurrent priority pool. We assume leaf ! is locked upon entrance. Line 
1 finds /’s right neighbor I’ and locks it. Line 2 tests the indicator of I’. If it is normal, 
we merge | and I’ in lines 4-21, do complete-merge as in the concurrent B-tree, and redo 
the extract operation in lines 22-23. The locks of leaves / and I’ are acquired only to 
change their indicator and right fields in lines 4-8. Lines 11-14 merge all segments in !’ 
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proc extract(1) 
%% extract a key from leaf 1 in the concurrent prionty pool 


(label#0) randomly pick up a segment s in | 
(label#1) lock s 
if indicator[s]) = normal then 
if s is not empty then 
extract the smallest key from s 
unlock s 
else 
if we have not tried to delete enough times then 
unlock s 
8 := another segment in | 
goto (label#1) 


%% do merge here 
lock | 
case indicator{]]: 
normal: 
%% normal merge 
unlock s 
if 1 is not the rightmost leaf then 
originate—merge(1) 
end 
split: 
unlock | 
unlock s 
s := another segment; goto (label#1) 
merging: 
unlock s 
unlock | 
help—merging(1, merging—leaffl], right[]]) 


% The merging—leaf and right fields of | should be 


% read before unlocking | 
goto (label#0) 
deleted: 
unlock 3s 
unlock | 


1 := right] % right/l] should be read before unlocking | 


goto (label#0) 
end 
end 
end 


%% indicator[sJ= deleted 
1 := right[s] 
unlock s 
goto (label#0) 
end 
end extract 


Figure 4.8: Extract operation on concurrent priority pool 
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proc originate—merging(1) 
%% try to merge l’s right neighbor I’ with | 
%% assume | is locked at entry 


(label#1) lock  % U is the right neighbor of | 
case indicator[]’] 
normal: 
indicator{l] := merging 
indicator[l’] := deleted 
merging—leaffl] := &P 
right[l] := right(1] 
right(l] := &l 
unlock I 
unlock | 
forall segments s’ in |’ do 
lock s’ 
match—merge(s’, 1, P, 1”) 
%% I” is the right neighbor of | and should be read before unlocking | 


unlock s’ 

lock | 

if (indicator[l] = merging) 
indicator[l] := normal 

else 
%% indicator is split—merging 
indicator[l] := split 

end 

unlock | 

Do complete merge like in the concurrent B—tree 

extract(l) 

split, split—merging: 

unlock |’ 

unlock | 

extract (I) 


%% This is a rarely happening loop. We cannot help split here, 
%% since I’, the destination leaf, is unlocked and may be merged again. 
merging: 
unlock |’ 
unlock | 
help—merging(l’, merging—leaf[]’], right[]’]) 
%% assume merge—leaffl’] and right[l’] are read before unlocking I’ 
extract(l) 
deleted: error 
end 
end originate—merge 


Figure 4.9: Merge two leaves of concurrent priority pool 
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with segments in leaf /. The match-merge procedure, which is described later, is called 
to ensure that every segment in / will be updated. While the originate-merging process 
is merging the segments in /’ into !, other processes can help to do the merge as shown 
in line 31. Lines 15-21 lock / to change its indicator back to normal or split. Once again, 
the leaf lock is held briefly. If indicator(l’] is split or split-merging, we just go back to 
extract again in line 27. If indicator[l’] is merging, we help merge some segments in 
lines 28-32. The indicator of I’ cannot be deleted, because deleted leaves are moved out 
of the linked list — they cannot be I’s right neighbor. 

Procedure match-merge, as shown in Figure 4.10, merges segment s’ of leaf I’ with 
the corresponding segment s of leaf |. Because there are the same number of segments 
in every leaf, it is not hard to create a one-to-one correspondence between segments in 
two leaves. Leaf /” was the right neighbor of 1, but may be not now. Consider the 
example shown in Figure 4.11, in which leaf / is changed to the split-merging state from 
the merging state, and a new leaf /new is created between / and /”. Segment s1 in / has 
been split, so s1’s right link points to Inew. Segment s0 in / has not been either split or 
merged yet, so its right link points to I’. Segment 32 in / has been merged but has not 
been split yet — its right link points to /”. The right links of segments in leaf / are set 
to point to l” if the segments have not been split or merged; otherwise, the right links 
are left unchanged. The split process, concurrently going on with the match-merge, will 
change the right links of all segments in / to point to Inew as shown in line 1 of the split 
procedure in Figure 4.7. Thus, the match-merge procedure will change s0’s right link 
to point to I” because it has not been either split or merged; 31’s right link will not be 
changed since it has been split; segment s2’s right link will be changed to point to Inew 
by the concurrent split process. 

Figure 4.10 also shows the pseudocode for the help-merging procedure. This help- 
merging procedure randomly picks up a segment s’ from leaf I’, locks it, and calls match- 
merge to merge the segment if s’ is in normal mode and non-empty, then unlocks it. 
Actually, we could help to merge more segments in the help-merging procedure. 

Assume there are enough number of segments in a leaf, so that there is not much 
contention on grabbing a segment from the leaf. If the segment is not empty, then the 
extract operation takes only constant time — it can just take the smallest key in the 
segment. If the segment is empty and we cannot find a non-empty one after several tries, 
we need to merge the leftmost leaf with its neighbor, which takes O(segnum * segsize) 
time. If we count in the time taken to do complete-merge, O(/g NV), the extract operation 


takes time O(lg N). 
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4.4 Summary 


This chapter presents another new concurrent priority queue called the concurrent pri- 
ority pool, which is based on concurrent B-trees and concurrent pools. The concurrent 
priority pool supports insert and extract operations like the parallel Fibonacci heap. The 
structure of the concurrent priority pool is very similar to the concurrent B-tree, except 
the leaves are replaced with concurrent pool-like data structures. Each leaf of a concur- 
rent priority pool consists of several segments, each of which contains a fixed number of 
keys. There can be different operations going on different segments in the same leaf. The 
lock granularity of normal insert and extract operations is pushed down to the level of seg- 
ments instead of leaves. Even when splits and merges happen, the leaves are locked only 
briefly. The insert and extract operations do not share critical resources in most cases, 
which is one of the reasons why the concurrent priority pool has the largest throughput 
among all the priority queues studied, as shown by the experimental results in Chapter 
5. Also, the concurrent priority pool provides a straightforward way of controlling the 


promising quality of extracted keys. 
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proc match—merge(s’, 1, I’, 1”) 
%% Match—-merge moves data from segment s’ of leaf I’ into the 
%% corresponding segment in leaf l. 


1 Lock s %% s is the segment in | corresponding to s’ in 
2 if (right[s}] = 1’) then 
%% s hasn’t been either merged or splitted 


3 right{s] := &l” 

4 separators] := right—bound[]’] 

5 end 

6 ‘Transfer data from segment s’ in |’ to s. 


%% In this way, we are sure that every segment in | is touched. 
7 If it does not all fit, insert the rest normally by calling insert procedure as futures. 
8 Unlock s 
9 indicator[s’] := deleted 
0 end match—merge 


proc help—merging(l, |’, I”) 


Choose a segment s’ in I’ 

Lock s’ 

if ((indicator{s’]) = normal) 
and (not empty(s’))) then 
match—merge(s’, |, 1’, 1”) 

end 

Unlock s’ 

end help—merging 


GWNIN OP OD 


Figure 4.10: Match merge corresponding segments in two leaves on concurrent priority 
pool 
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Figure 4.11: Match-merging two leaves | and |’ 
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Chapter 5 
Experimental Evaluation 


In this chapter we present the experimental evaluation of the parallel Fibonacci heap and 
the concurrent priority pool and compare them with the concurrent binary heap. Section 
5.1 describes the experimental environment and model. Section 5.2 shows the effects of 
different parameters on the parallel Fibonacci heap. Section 5.3 shows the the effects 
of different parameters on the concurrent priority pool. Section 5.4 compares different 
concurrent priority queues in terms of throughput. Section 5.5 presents two applications 
of concurrent priority queues: the single source shortest path problem(SSSP) and the 
vertex cover problem(VCP). Finally, Section 5.6 summarizes this chapter. 


5.1 Experimental Environment 


Experiments have been performed on Encore Multimaxes. The language used is Mul-T 
[KHM89], a Lisp-like programming language with futures and lock mechanisms. Two En- 
core machines have been used in the experiments: one with ten processors at LCS/MIT, 
where most of the debugging tests were done; one with twenty processors at the Argonne 
National Lab ?. 

In most of the experiments, the master-worker model is used: a master spawns a fixed 
number of workers, each of which performs access-think cycles. An access can be an insert, 
extract, decrease key or delete on a concurrent priority queue. Think time is modeled 
by a simple delay in a loop; the number of iterations denotes the think time. Think = 
0 means the workers do not think at all, and think = 1000 means think consists of 1000 


1Only 18 processors can be used for running Mul-T. Due to some unknown errors, running Mul-T 
with large number of processors has caused the Encore at Argonne Lab to crash. Thus, we did not get 
all possible data up to 18 processors. 
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loop iterations. Since the decrease key and delete operations are not supported well on 
binary-heap-based concurrent priority queues and concurrent priority pools, we compare 
them by measuring only the insert and extract operations. In most trials described in 
this chapter, the following worker model is used unless otherwise stated: the number 
of workers is equal to the number of processors available; each worker performs access- 
think cycles on a heap initially containing 1000 keys, and access to the priority queue 
is composed of 55% inserts and 45% extracts. The keys inserted are randomly chosen 
from the range 0 to 10000. All workers are started at approximately the same time, 
and the first worker that finishes 1000 access-think cycles will stop other workers. The 
throughput is the total number of cycles performed by all the workers divided by the 
elapsed time. I used the timer facilities of Mul-T version 25 to collect data. 


5.2 Parallel Fibonacci Heap 


The parallel Fibonacci heap has three parameters: mazpt, buffersize, and strictness as 
described in section 3.4.5. We have tested different combinations of buffersize and strict- 
ness, with mazpt set to be the same as the number of processors. Figure 5.1 shows the 
throughput (cycles/second) vs. the number of processors, while the think time is 0. We 
can see that the throughput in the trials is linearly increasing with the number of proces- 
sors, from around 70 with 2 processors to around 680 with 18 processors. We can roughly 
see from Figure 5.1 that all the curves are very close to each other, which indicates that 
the parameters buffersize and strictness do not affect the throughput too much. Trials 
with larger buffersize and strictness have a little larger throughput. However, strictness 
has more impact than buffersize. Note in Figure 5.1 that the throughput is quite good 
when buffersize = 1, and strictness = 1. Buffersize = 1 means only the least key in a 
parallel Fibonacci section is selected as a candidate for the promising list in the process of 
consolidation, and strictness = 1 means the promising list will only get better candidates 
from direct promise-checking since the good heuristic function filters out almost all keys 
worse than keys in the promising list. 

Figure 5.2 shows the throughput vs. the number of processors when think = 1000, 
and different buffersize and strictness. The curves are quite similar to the case of think 
= 0, except the throughput is less due to the think time. Figure 5.2 also shows the trials 
with strictness equal to 1. It shows the throughput of the parallel Fibonacci heap does 


?This avoids extracting from an empty priority queue. 
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Figure 5.1: Parallel Fibonacci heap: Throughput (cycles/second) vs. number of proces- 
sors while think =0, different values of parameters buffersize and strictness 


not change too much with different buffer when strictness = 1. 


5.3 Concurrent Priority Pool 


The concurrent priority pool has two parameters: segnum, which is the number of seg- 
ments in a leaf, and segsize, which is the number of keys contained in each segment and 
the number of < pointer, bound > pairs in an interior node. We have done some exper- 
iments on different values of segnum and segsize. In the experiments, ordinary blocking 
locks are used instead of read-write locks (see Section 4.1). Using read write locks should 
reduce the contention on interior nodes of the B-tree. Figure 5.3 shows the throughput 
vs. the number of processors when think = 0, segsize = 3, and different segnum. Fig- 
ure 5.4 shows the throughput vs. the number of processors when think = 0, segsize = 
5, and various segnum. Figure 5.5 shows the curves when think = 0, segsize = 7, and 
different segnum. These three graphs have one thing in common: the throughput are 
linearly increasing with the number of processors, and all the curves are close to each 
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Figure 5.2: Parallel Fibonacci heap: Throughput (cycles/second) vs. number of proces- 
sors while think = 1000, different values of parameters buffersize and strictness 


other, which means the parameters do not affect the throughput too much. 


5.4 Comparing Different Concurrent Priority Queues 


We have seen how the parallel Fibonacci heap and the concurrent pool perform on differ- 
ent parameters. Here, we consider how they compare with each other, and how they com- 
pare with other kinds of concurrent priority queues, such as the concurrent binary heap. 
The concurrent binary heap compared here was developed by Rao and Kumar{RK88b]. 
They proposed a method of performing insert and delete operations concurrently in a 
top-down order on a balanced binary heap. The insert operation locks one node at a 
time, and the delete operation locks three nodes, a parent and two children, at a time. 
Their scheme has strict semantics for the extract operation, which means the extract 
operation always retrieves the most promising key. The problems with strict semantics 
have been discussed in Section 3.1. 

Figure 5.6 shows a comparison of the throughput of different priority queues: the 


66 


eae Segsize = 3 


Figure 5.3: Concurrent priority pool: think = 0, segsize = 3, different segnum 


sequential binary heap, the concurrent binary heap, the concurrent priority pool, and the 
parallel Fibonacci heap. Each operation on the sequential binary heap has an exclusive 
lock on the whole heap during the entire period of the operation. The parallel Fibonacci 
heap tested here is an average one, with bufferstze and strictness both equal to one. The 
concurrent priority pool tested has segnum equal to the number of processors, and segsize 
equal to 5. The graph shows that the throughput of the parallel Fibonacci heap and the 
concurrent priority pool are both linearly scalable, and that the concurrent priority pool 
has the largest throughput among these four priority queues. The concurrent binary 
heap’s throughput saturates when the number of workers is more than about eight. 
Since the sequential binary heap holds a lock on the entire heap during an operation, its 
throughput decreases as the number of processor increases. Because all the insert and 
extract operations of a concurrent binary heap both have to go through and lock the 
tree root, the tree root becomes a bottleneck when the number of processes accessing 
the concurrent binary heap increases. This bottleneck problem is reflected in Figure 5.6, 
which shows that the throughput of a concurrent binary heap saturates quickly. Overall, 
the concurrent binary heap is not as scalable and efficient as either the parallel Fibonacci 
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Figure 5.4: Concurrent priority pool: think = 0, segsize = 5, different segnum 


heap or the concurrent priority pool. 
Figure 5.7 shows the comparison when think = 1000. The contention on the priority 
queues is less than that of think = 0; this helps slow down the saturation of the less 


scalable priority queues. 


5.5 Applications 


Two kinds of applications of concurrent priority queues are presented in this section. One 
is the single source shortest path problem which is in the computational class P. The 
other one is the vertex cover problem which is in the computational class NP-complete. 


5.5.1 Single Source Shortest Path Problem 


The single source shortest path problem is as follows: given a source vertex s in a weighted 
graph G =< V,E >, find a path of minimum weight from s to every v € V. We choose 
Dijkstra’s algorithm as our basis[CLR90]. As shown in Figure 5.8, we keep a priority 
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Figure 5.5: Concurrent priority pool: think = 0, segsize = 7, different segnum 


queue Q of vertices in V. The priority of a vertex in Q is its distance from the source 
vertex s. The algorithm always chooses the vertex u that is the closest to s to add into 
S. For each vertex u’s neighbor v, we check if a shorter path has been found: if so, 
we update d[v] in line 11. Note that vertices are never added to Q, and each vertex is 
extracted from @ and added to S exactly once. 

The parallel single source shortest algorithm is presented in Figure 5.9. Independent 
workers work on a concurrent priority queue. These workers perform the same job as 
their sequential counterparts: extract a close vertex n from the queue and check all 
n’s neighbors to see if closer paths have been found. Unlike the sequential Dijkstra’s 
algorithm, when we extract a vertex from the concurrent priority queue, the vertex does 
not necessarily have to be the closest one from the source. In this way, a node may 
be inserted into the queue several times if a better path is found later on. However, 
the experiments show that on average each node is inserted no more than 1.3 times. 
Similarly, more decrease key operations are performed. 

This algorithm requires the use of the decrease key operation; since the decrease key 
operation cannot be effectively implemented on the concurrent binary heap, we only 
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Figure 5.6: Comparing different priority queues: think = 0 


compare the sequential binary heap, the parallel Fibonacci heap, and the concurrent 
priority pool. For the concurrent priority pool, the decrease key operation is implemented 
as a combination of delete and insert operations: first we delete the old key from the 
pool, then we insert the new key into the pool. In this way, a decrease key operation 
for the concurrent priority pool consists of two accesses whereas it is a simple operation 
with amortized constant cost for the parallel Fibonacci heap. In the implementations, 
we have kept track of where a key is in a priority queue to avoid searching when we do 
decrease key operations. 

Figure 5.10 shows the speedup graph of the single source shortest path problem. The 
graph has 1000 vertices and the degree of each vertex is randomly chosen from 0 to 
either 10 or 50. The sequential binary heap is used to compute speedup. The sequential 
program is very efficient (it is in computational class P) and always finds the shortest 
path to any vertex in shorter steps as compared to the case of concurrent priority queues 
where we do some extra work such as inserting a vertex in the queue several times and 
decreasing the distance of a vertex more often. As expected, the speedup ranges from 
around 0.3 with one processor to about 4.5 with fifteen processors. The parallel Fibonacci 
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Figure 5.7: Comparing different priority queues: think = 1000 


proc Dijkstra(G, s) 
% Find the shortest path from source s 


for each vertex v in V[G] 
do d[v] := co % initialize distance to be S\infty$ 


d{s} := 0 

S:= @ 

Q := V{G] 

while Q #@ do 
u := extract—min(Q) 
S := SUf{u} 


WOMWMNIADAAN SL Wr 


for each vertex v in Adj[u] do % relax edge (u, v) 
if d[{v] > d[uj + w{u, v] then 


d[v} := dfu] + wfu, v] % this is a decrease key operation 


end 


end Dijkstra 


Figure 5.8: Dijkstra’s single source shortest path algorithm 
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heap has slightly greater speedup than the concurrent priority pool on large number of 
processors (around ten). This could be caused by the fact that the decrease key operation 
on the parallel Fibonacci heap is more efficient. 


5.5.2 Vertex Cover Problem 


A vertex cover of an undirected graph G = (V, Z) is a subset V’ € V such that if (u,v) 
is an edge of G, then either u € V’ or v € V’ or both. The size of a vertex cover is the 
number of vertices in it. The vertex cover problem (VCP) is finding a minimal vertex 
cover for G [Vor87, PS82, CLR90, KRR88]. VCP is an NP-complete problem [CLR90]. 
As many other NP-complete problems, VCP can be attacked with branch-and-bound 
algorithms [LW66, jLW84, LS84]. 

Figure 5.11 shows a parallel branch-and-bound algorithm for VCP. In line 1 of the 
master procedure, an upper bound Cp of the VCP is found by using a greedy algorithm, 
i.e., picking vertices with larger degree first to get a cover. We start from an empty 
cover and fork off some workers to search the state space of the VCP. The priority queue 
Q keeps track of all the partial subcovers that have better lower bound than Cp. Each 
worker repeatedly takes subcovers out of Q and puts bigger subcovers that have lower 
bounds smaller than Co into Q until the smallest vertex cover is found. In the pseudocode 
for the workers, line 2 extracts a subcover C’. Line 6 finds a vertex z not in C that covers 
edges not already covered by C. We generate C’s two successors C, and C2 by either 
including z or excluding z in lines 7-10. Excluding z is equivalent to including all z’s 
neighbors into the cover. In line 11, we compute the lower bounds for the newly generated 
subcovers. A lower bound 6 for a subcover C' means that every vertex cover for G that 
contains C' will be of size at least b. Intuitively, 6 = |C| + the least number of vertices 
that have to be added into C to form a cover. We compute the second item by finding 
a match M of the graph uncovered by C*. Because at least one of the two endpoints of 
each edge in M has to be included in a vertex cover, b = |C|+|M|. In line 12, if we find 
a vertex cover that has better bound than the global bound Co, then we replace Co with 
the new cover. We insert subcovers that have better lower bound than Cp back into Q. 

Figure 5.12 shows the speedup graph of VCP on a 50 vertex graph with degree 
randomly chosen from 0 to either 10 or 16. The sequential binary heap is used as the 
basis to compute speedup. The concurrent priority pool and the parallel Fibonacci heap 


3A match is a set of independent edges, i.e., edges that do not share common vertex. We can use any 
kind of match to compute the lower bound here; the maximal match gives the best bound, but takes 
more time to find. In the experiments, a simple greedy match is used. 
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both have good scalable speedup whereas the concurrent binary heap saturates when 
the number of processors is more than ten. The graph also shows that the concurrent 
priority pool has slightly greater speedup than that of the parallel Fibonacci heap. Both 
the concurrent priority pool and the parallel Fibonacci heap have greater throughput 
when the degree upper bound of the vertices is bigger (i.e., 16 in the graph). The results 


are quite consistent with the synthetic data presented in the last few sections. 


5.6 Summary 


Some experimental results on different concurrent priority queues have been presented in 
this chapter. For the parallel Fibonacci heap, the parameters buffersize and strictness do 
not affect the running time much. In fact, the parallel Fibonacci heap performs fairly well 
in the quite strict case, when buf fersize = 1 and strictness = 1. For the concurrent 
priority pool, the effects of the parameters segnum and segstze do not seem to affect 
the throughput much either. The comparison of different concurrent priority queues, 
as shown in Figure 5.6, indicates that the parallel Fibonacci heap has linearly scalable 
throughput; the concurrent priority pool has the largest throughput and at the same 
time it has a linearly scalable performance. The throughput of the concurrent binary 
heap saturates when the number of processes accessing it is more than about eight. The 
sequential binary heap’s throughput decreases as the number of processors increases. 
Two different types of applications of concurrent priority queues, namely single source 
shortest path problem and vertex cover problem, have been implemented. The single 
source shortest path problem is in the computational class P and can be efficiently solved 
by using sequential binary heaps. Both the parallel Fibonacci heap and the concurrent 
priority pool have good scalable speedup, although it is around 0.3 with 1 processor and 
4.5 with 15 processors. The vertex cover problem is an NP-complete problem. Both the 
parallel Fibonacci heap and the concurrent priority pool have good scalable speedup. 
When the degrees of vertices in the graph are relatively large, the speedup is close to 
linear. The concurrent binary heap’s speedup saturates when the number of processors is 
more than about ten. The results on applications are quite consistent with the synthetic 


data. 
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%% pseudocode for the single source shortest path problem 
%% Find the shortest paths from source s to all other nodes in the graph 
%% Data structure: the graph is represented as an adjecent ... 


proc worker(q) 


loop 
n := extract—min(q) 
if n = nil then 
%% q is empty 
Termination test; see if the worker can quit 
else 
mark{n] := not-in—queue 
%% n has been taken out of q 
For each neighbor in adj[n] do 
lock neighbor 
if d(n) + w(n, neighbor) < d(neighbor) then 
if mark(neighbor] = not—in—queue then 
insert neighbor into q with new—distance 
else 
decrease—key(neighbor, new—distance) 
end 
end 
unlock neighbor 
end 
end 
end worker 
proc master 
Q:= 8 
Put source s in Q with priority 0 
Fork off some workers to work on q 
end master 


Figure 5.9: Parallel single source shortest path algorithm 
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Figure 5.10: The speedup graph for the SSSP problem 
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proc worker(Q) 


loop 
subcover := extract—min(Q) 
% subcover = (C, b) where C is the set of vertices and 
% 6 is the lower bound (i.e., the key in Q). 
if subcover = nil then 
% Q is empty 
Termination test; see if the worker can quit 
else 
Find a vertex x not in the cover C such that x covers 
edges that are not already covered by C 
Generate two subcovers C, and C2 
C; includes vertex x 
C2 includes x’s neighbors 
Compute the corresponding lower bounds b; and bg 
if one of the new subcovers forms a vertex cover that 
is smaller than the current cover Cy then 
replace the current cover with the new one 


if newly generated subcovers have better bound than the current 
one then insert them into Q 


end 
end worker 


proc master(G) 

Generate an initial cover Co using greedy algorithm 
Q := empty cover with bound 0 

Fork off some worker(Q) 


end master 


Figure 5.11: The branch-and-bound algorithm for the vertex cover problem 
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Figure 5.12: The speedup graph of the VCP 
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Chapter 6 


Conclusion and Future Directions 


6.1 Contributions 


This thesis presented two novel concurrent priority queues: the parallel Fibonacci heap 
and the concurrent priority pool, both of which have non-strict semantics (see section 
3.1). The parallel Fibonacci heap is based on the sequential Fibonacci heap, theoreti- 
cally the most efficient data structure for sequential priority queues. This scheme employs 
distributed small critical sections so that it has linearly scalable throughput. The experi- 
mental results in Chapter 5 showed that the parallel Fibonacci heap has linearly scalable 
throughput that is larger than that of the concurrent binary heap with even small num- 
ber of processors. A concurrent access scheme for a doubly linked list was described as 
part of the Fibonacci heap. 

The concurrent priority pool, based on the concurrent B-tree and the concurrent pool, 
has the largest throughput among all of the priority queues tested, besides providing 
a easy way to control the quality of extracted nodes. The experiments showed that 
the concurrent priority pool also has linearly scalable throughput. The three kinds of 
concurrent priority queues, namely the parallel Fibonacci heap, the concurrent priority 
pool, and the concurrent binary heap, were evaluated on an Encore machine using the 
language Mul-T. 

Two different types of applications of concurrent priority queues have been tested. 
One is the single source shortest path problem, which belongs to the computational class 
P. The other one is the vertex cover problem, an NP-complete problem. The results show 
that the parallel Fibonacci heap and the concurrent priority pool both have good scalable 
speedup on the applications whereas the concurrent binary heap saturates quickly. The 


speedup is larger on VCP than on SSSP. 
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6.2 Future Directions 


6.2.1 More experiments 


More experiments will be done when the simulator asim becomes practically usable. 


6.2.2 Distributed Memory Model 


The concurrent priority queues discussed in this thesis are mainly based on the shared 
memory model. Here, we discuss see how they can be modified to use a distributed 
memory model. 

The parallel Fibonacci heap is nicely divided into many sections. In a distributed 
memory model, each processor can have a section in its local memory and the promising 
list may be replicated. The promising list does not have to be updated synchronously on 
all processors. The insert operation can insert in the process’ local section, or randomly 
pick up a remote section to insert in depending on the network communication cost. 
The extract operation first tries to extract a local promising node. If there are no local 
promising nodes, the extract process finds remote promising nodes through the promising 
list. If the consolidation process finds that the quality of local nodes is not as good as 
nodes at remote processors, then some trees can be moved to balance the quality of nodes 
on different processors. Since a parallel Fibonacci section is a forest of trees linked in a 
doubly linked list, it is easier to move data around than if a section were a binary heap. 

For the concurrent priority pool whose skeleton is a concurrent B-tree, we can imple- 
ment each B-tree interior node and segment as an object. Since all the insert operations 
go through the B-tree root, we may want to replicate interior nodes close to the root 
on different processors to diffuse the traffic on the upper part of the B-tree '. Similarly, 
because all extract-min operations go through the leftmost leaf, it would be desirable to 
put different segments in the leftmost leaf on different processors. 


6.2.3 Other Related Research 


Kumar et al [KRR88] introduced several distributed binary heaps. They used three kinds 
of communication methods among processors to balance load: blackboard, random, and 
ring, and pointed out that the blackboard approach is the best. 


!This problem is examined in Paul Wang’s thesis[Wan90]. 
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Driscoll et al [DGST88] have proposed a parallel priority queue for SIMD machines 
that is called a “relaxed heap”. Van Emde Boas presented sequential priority queues 
{vEB75] that support insert, extract, delete and other operations in worst-case time 
O(lg lg n), if all the keys in the priority queue are restricted in the set {1, 2, ..., n}. It 
would be interesting to see if a more efficient parallel priority queue can be built using 
this as a base. 
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